Accessing UK Air Quality and Meteorological Data with {openair} and {worldmet}

Published

December 5, 2022

Abstract
Often the most difficult step to doing data analysis is getting hold of data! Analysts often have to deal with messy, inconsistent data, and end up spending more of their time beating it into shape than gaining insight. The openair and worldmet provide a consistent method of accessing AQ and Met data from around the UK, and this document details how they can be effectively used alongside some other R tools.

1 Introduction

Often the greatest obstacle to carrying out effective data analysis is accessing the data itself. Often we’ll want to synthesise multiple data sets together from different air quality monitoring or meteorological measurement station, which may be easier said than done! Consider the potential steps to do so:

  1. First, we need to find about the measurement site(s) available to us — where are they? How many are in the area we’re interested in? What kind of sites are they (e.g., roadside, urban, rural)?

  2. We are likely interested in many sites, so we’d have to seek out multiple sets of data. Where are these data all stored? What format are they stored in? Are they consistently formatted?

  3. We are likely also interested in multiple years of data, so once again we’d have to download each year separately. Did the formatting change year-on-year? Did the site start/stop running in the time period we’re interested in?

  4. We’d need to bind these data together, which may be difficult if columns were inconsistently ordered or named.

  5. We’d then need to find a nearby met site and time-align its measurements with our air quality data. Where is the nearest met station? How does it store its data?

  6. Finally, if we want to expand our analysis to other sites or time periods, we’d need to start this whole process again!

Thankfully, the extended openair family – openair, worldmet and {openairmaps} – streamlines going through this process significantly:

  • openair exports functions like importAURN(), which allow you to import air quality measurements and statistics in a consistent way.

  • worldmet is a smaller package which contains tools to identify met stations and import met data.

  • {openairmaps} can visualise AQ monitoring networks, which allows you to discover measurement sites in a more visual, interactive way.

Ensure you have these three packages installed and loaded before you continue. Also install and load the tidyverse packages dplyr and stringr, which will be used later to assist with the data importing process.

2 Accessing Air Quality Data

2.1 Obtaining Site Metadata

The first function you will likely use is importMeta(), which returns site metadata information. Supported networks include the UK Automatic Urban and Rural Network, the devolved networks (Air Quality England, Scottish, Wales and Northern Ireland), and networks managed by English local authorities. By default, basic site information is returned — the site name, code and type, and its latitude and longitude.

importMeta(source = "aurn")

For more complete metadata, the all argument can be set to TRUE. This additionally returns information related to the individual pollutants measured at each site and when they started (and stopped, if appropriate) being measured. When appropriate, you’ll also receive more context related to the site (zone, agglomeration, local authority, etc.).

importMeta(source = "aurn", all = TRUE)

The key piece of information which you’ll want to take from importMeta() is the site code(s) of interest. These can be used in openairs other “import” functions. To begin with, this can be done in a manual way. One way you could choose to do so is to View() the data frame in RStudio and copy the sites of interest.

site_meta <- importMeta(source = "aurn")

View(site_meta)

Alternatively, you could use openairmaps::networkMap() to view the sites in their geographical context. By clicking on the different markers you will be provided all of the information presented by importMeta(all = TRUE) in a popup.

networkMap(source = "aurn")

Figure 1: A networkMap() of the AURN plotted using {openairmaps}.

One of the benefits of using networkMap() is that multiple networks can be visualised at once. This can be useful in a few different ways, but most commonly when you are interested in air quality data from a particular geographic region but aren’t overly concerned about its specific origin. For example, Figure 2 shows the AURN, AQE network and English locally managed networks on one map.

networkMap(source = c("aurn", "aqe", "local"))

Figure 2: A networkMap() of the AURN, AQE network and locally managed networks.

Try some of the options …

There are different networks to consider and usefully, with the date option, you can see what sites were available at any time through history. Try experimenting with the networkMap function and see if you can look back at what network(s) looked like in 1990…

Once you have the site codes of interest, you can move onto importing your data of choice.

2.2 Importing Air Quality Data

Now armed with the relevant site codes, you can now use the relevant “import” function to access your data of choice. As we have been looking at the AURN so far, the relevant function for us is importAURN(). The important arguments to provide are:

  • site: the site code(s) of interest, obtained from importMeta().

  • year: the year(s) of interest. You can provide:

  • a single year (e.g., 2020)

  • multiple individual years (e.g., c(2010, 2015, 2020))

  • a range of years (e.g., 2015:2020)

For example, the below code chunk imports air quality data for the two measurement sites in York for the years 2015 to 2020.

importAURN(
  site = c("YK10", "YK11"),
  year = 2015:2020
)
# A tibble: 105,216 × 15
   site      code  date                  nox   no2    no  pm10 pm2.5   v10  v2.5
   <chr>     <chr> <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 York Boo… YK10  2015-01-01 00:00:00    NA    NA    NA  29.7  26.6   8.1   6.7
 2 York Boo… YK10  2015-01-01 01:00:00    NA    NA    NA  25.9  15.4   4.9   2.8
 3 York Boo… YK10  2015-01-01 02:00:00    NA    NA    NA  17.4  12.7   9     4.5
 4 York Boo… YK10  2015-01-01 03:00:00    NA    NA    NA  16.9  14     4.7   6.7
 5 York Boo… YK10  2015-01-01 04:00:00    NA    NA    NA  15.7  16.7   3.8   4.2
 6 York Boo… YK10  2015-01-01 05:00:00    NA    NA    NA  11.5   6.7   2.7   0.4
 7 York Boo… YK10  2015-01-01 06:00:00    NA    NA    NA  15.4  10.6   5.1   4.4
 8 York Boo… YK10  2015-01-01 07:00:00    NA    NA    NA  14.4  14     4.4   5  
 9 York Boo… YK10  2015-01-01 08:00:00    NA    NA    NA  13    12.1   3.2   4.9
10 York Boo… YK10  2015-01-01 09:00:00    NA    NA    NA  16.6  11.3   5     2.8
# … with 105,206 more rows, and 5 more variables: nv10 <dbl>, nv2.5 <dbl>,
#   ws <dbl>, wd <dbl>, air_temp <dbl>

importAURN() and other similar functions can also be used in a more advanced way. Other arguments include:

  • pollutant: the pollutants(s) of interest. You can provide:

  • a single pollutant (e.g., "nox")

  • multiple pollutants (e.g., c("nox", "no2"))

  • nothing at all! (this imports all pollutants measured by the site(s))

  • meta: when TRUE this will also return the site latitude, longitude and type.

  • to_narrow: when TRUE this will change the structure of the resulting data frame to a “narrow” or “tidy” format. Much has been written about “tidy” data [@WickhamTidy2014], but this argument effectively stacks all of the pollutant columns on top of one another.

  • hc: when TRUE any hydrocarbon data is also imported. This is FALSE by default as relatively few sites measure hydrocarbons, most users won’t be interested in using the data, and the resulting imported data frame is considerably larger with hydrocarbons included.

The below code chunk once again imports York air quality data from 2015 to 2020, but this time only NO2, NOx and PM10 data in a narrow/tidy format, appended with site metadata.

importAURN(
  site = c("YK10", "YK11"),
  pollutant = c("no2", "nox", "pm10"),
  year = 2015:2020,
  meta = TRUE, 
  to_narrow = TRUE
)
# A tibble: 315,648 × 8
   date                site         code  latitude longi…¹ site_…² pollu…³ value
   <dttm>              <chr>        <chr>    <dbl>   <dbl> <chr>   <chr>   <dbl>
 1 2015-01-01 00:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
 2 2015-01-01 01:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
 3 2015-01-01 02:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
 4 2015-01-01 03:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
 5 2015-01-01 04:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
 6 2015-01-01 05:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
 7 2015-01-01 06:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
 8 2015-01-01 07:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
 9 2015-01-01 08:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
10 2015-01-01 09:00:00 York Bootham YK10      54.0   -1.09 Urban … no2        NA
# … with 315,638 more rows, and abbreviated variable names ¹​longitude,
#   ²​site_type, ³​pollutant

One last argument which hasn’t been discussed is data_type, which controls the type of data returned by the import function. By default, this is “hourly” averaged data, but there are a few others to consider:

  • "daily", "monthly" and "annual" are all alternative averaging periods, with the latter two also returning data capture information.

  • "15_min" pertains to SO2, and returns 15-minute average concentrations.

  • "8_hour" and "daily_max_8" both pertain to CO and O3, and return rolling 8 hour means and maxima respectively.

  • "daqi" returns Defra’s Daily Air Quality Index (DAQI). More information about this statistic is available here.

Consider the structure of the data returned when data_type = "annual". There is a column for each pollutant, alongside a separate *_capture column which shows the annual percentage data capture (0–1). data_type = "annual" pairs well with to_narrow = TRUE, which returns three columns - “species” for the pollutant name, “value” for the annual mean, and “data_capture” for the data capture percentage.

importAURN(
  data_type = "annual",
  year = 2020
)
# A tibble: 172 × 38
   uka_c…¹ code  site  date                   o3 o3_ca…² o3.su…³ o3.da…⁴ o3.ao…⁵
   <fct>   <chr> <chr> <dttm>              <dbl>   <dbl>   <dbl>   <dbl>   <int>
 1 UKA003… ABD   Aber… 2020-01-01 00:00:00  45.5   0.610   0.718    57.5      NA
 2 UKA005… ABD7  Aber… 2020-01-01 00:00:00  NA    NA      NA        NA        NA
 3 UKA006… ABD8  Aber… 2020-01-01 00:00:00  NA    NA      NA        NA        NA
 4 UKA004… ACTH  Auch… 2020-01-01 00:00:00  58.9   0.992   0.993    67.2    2038
 5 UKA005… AGRN  Birm… 2020-01-01 00:00:00  45.6   0.586   0.345    57.3      NA
 6 UKA001… AH    Asto… 2020-01-01 00:00:00  66.2   0.988   0.983    74.9    5717
 7 UKA005… ARM6  Arma… 2020-01-01 00:00:00  NA    NA      NA        NA        NA
 8 UKA006… BAAR  Ball… 2020-01-01 00:00:00  NA    NA      NA        NA        NA
 9 UKA005… BALM  Ball… 2020-01-01 00:00:00  NA    NA      NA        NA        NA
10 UKA003… BAR3  Barn… 2020-01-01 00:00:00  46.0   0.858   0.759    57.2      NA
# … with 162 more rows, 29 more variables: o3.aot40f <int>, somo35 <dbl>,
#   somo35_capture <dbl>, no <dbl>, no_capture <dbl>, no2 <dbl>,
#   no2_capture <dbl>, nox <dbl>, nox_capture <dbl>, so2 <dbl>,
#   so2_capture <dbl>, co <dbl>, co_capture <dbl>, pm10 <dbl>,
#   pm10_capture <dbl>, nv10 <dbl>, nv10_capture <dbl>, v10 <dbl>,
#   v10_capture <dbl>, pm2.5 <dbl>, pm2.5_capture <dbl>, nv2.5 <dbl>,
#   nv2.5_capture <dbl>, v2.5 <dbl>, v2.5_capture <dbl>, gr10 <dbl>, …
Try importing data in different ways

We’ll be accessing UK data over the next couple of days, so it is good to get a feel for the different options and how they affect what is imported and how it is formatted.

2.3 Automation with dplyr

When we write code, our goal should be to limit the number of “human steps” needed. There are lots of reasons to do so, but just in the realm of picking out sites:

  • Humans are fallible and end up doing things like misspelling site names, or missing relevant sites when manually searching a network.

  • Humans aren’t as quick as computers at doing pattern recognition.

  • If something needs to change, having human steps makes that change more painful. For example, a shift from considering background sites to traffic sites is a one word change in the code, but perhaps a ten minute job by a human!

We can think of the “hunt through the data frame to find the sites you are interested in” as a gap in our script, and we need to close this gap. We need a programmatic way of getting from having an importMeta() output to having a list of sites to provide to importAURN(). This is what we need dplyr (and a bit of stringr) for!

The dplyr function we’re going to use is filter(), which uses logical operators to subset rows of a dataframe. That is quite an abstracted description, so you can also think of it as a way to write code to pull certain rows out of your data based on features of the rows! Lets try this a few different ways.

First, lets get all of the urban background sites. This uses the EQUALS == logical operator. The expression below can be read “in the site_meta data frame, filter for rows where site_type is equal to ‘Urban Background’”. This is case sensitive, so make sure you have your capitalisation correct!

site_meta <- importMeta()
filter(site_meta, site_type == "Urban Background")
# A tibble: 127 × 5
   site                 code  latitude longitude site_type       
   <chr>                <chr>    <dbl>     <dbl> <chr>           
 1 Aberdeen             ABD       57.2     -2.09 Urban Background
 2 Aberdeen Erroll Park ABD9      57.2     -2.09 Urban Background
 3 Ballymena Ballykeel  BALM      54.9     -6.25 Urban Background
 4 Barnsley             BARN      53.6     -1.48 Urban Background
 5 Barnsley 12          BAR2      53.6     -1.49 Urban Background
 6 Barnsley Gawber      BAR3      53.6     -1.51 Urban Background
 7 Belfast Centre       BEL2      54.6     -5.93 Urban Background
 8 Belfast East         BEL       54.6     -5.90 Urban Background
 9 Belfast South        BEL3      54.6     -5.91 Urban Background
10 Bircotes             BIR       53.4     -1.05 Urban Background
# … with 117 more rows

Perhaps we are interested in the AURN sites that measure CO and SO2. filter() can contain multiple logical expressions, so here we are filtering for the variable column to be either ‘CO’ or ‘SO2’ (using the %in% operator) and the end_date to be ‘ongoing’.

site_meta_all <- importMeta(all = TRUE)
filter(site_meta_all, variable %in% c("CO", "SO2"), end_date == "ongoing")
# A tibble: 35 × 13
   code  site        site_…¹ latit…² longi…³ varia…⁴ Param…⁵ start_date         
   <chr> <chr>       <chr>     <dbl>   <dbl> <chr>   <chr>   <dttm>             
 1 BALM  Ballymena … Urban …    54.9   -6.25 SO2     Sulphu… 2010-01-01 00:00:00
 2 BAR3  Barnsley G… Urban …    53.6   -1.51 SO2     Sulphu… 1997-07-07 00:00:00
 3 BEL2  Belfast Ce… Urban …    54.6   -5.93 SO2     Sulphu… 1992-03-08 00:00:00
 4 BEL2  Belfast Ce… Urban …    54.6   -5.93 CO      Carbon… 1992-03-08 00:00:00
 5 BMLD  Birmingham… Urban …    52.5   -1.92 SO2     Sulphu… 2020-01-01 00:00:00
 6 CARD  Cardiff Ce… Urban …    51.5   -3.18 SO2     Sulphu… 1992-05-12 00:00:00
 7 CARD  Cardiff Ce… Urban …    51.5   -3.18 CO      Carbon… 1992-05-12 00:00:00
 8 CHBO  Chilbolton… Rural …    51.1   -1.44 SO2     Sulphu… 2016-01-11 00:00:00
 9 DERR  Derry Rose… Urban …    55.0   -7.33 SO2     Sulphu… 2016-03-21 00:00:00
10 ED3   Edinburgh … Urban …    55.9   -3.18 SO2     Sulphu… 2003-11-24 00:00:00
# … with 25 more rows, 5 more variables: end_date <chr>, ratified_to <dttm>,
#   zone <chr>, agglomeration <chr>, local_authority <chr>, and abbreviated
#   variable names ¹​site_type, ²​latitude, ³​longitude, ⁴​variable,
#   ⁵​Parameter_name

Earlier we also loaded stringr, which is a package that streamlines working with character data. The function that is relevant for this work is str_detect(), which can be used in filter(). The below code can be read “in the site_meta data frame, filter for rows where the string ‘Urban’ is present in site_type”. This effectively returns all of the Urban Traffic and Urban Background sites!

filter(site_meta, str_detect(site_type, "Urban"))
# A tibble: 235 × 5
   site                           code  latitude longitude site_type       
   <chr>                          <chr>    <dbl>     <dbl> <chr>           
 1 Aberdeen                       ABD       57.2     -2.09 Urban Background
 2 Aberdeen Erroll Park           ABD9      57.2     -2.09 Urban Background
 3 Aberdeen Union Street Roadside ABD7      57.1     -2.11 Urban Traffic   
 4 Aberdeen Wellington Road       ABD8      57.1     -2.09 Urban Traffic   
 5 Armagh Roadside                ARM6      54.4     -6.65 Urban Traffic   
 6 Ballymena Antrim Road          BAAR      54.9     -6.27 Urban Traffic   
 7 Ballymena Ballykeel            BALM      54.9     -6.25 Urban Background
 8 Barnsley                       BARN      53.6     -1.48 Urban Background
 9 Barnsley 12                    BAR2      53.6     -1.49 Urban Background
10 Barnsley Gawber                BAR3      53.6     -1.51 Urban Background
# … with 225 more rows

If we want to programatically extract the York sites, we could go about it in a couple of different ways:

  • We could use filter(site_meta, str_detect(site, "York")) to extract all sites with “York” in the name.
  • We could use filter(site_meta_all, local_authority == "York") to filter for the local authority being ‘York’.

In R, there are often a few different ways to do the same thing!

To pull a lot of this together, the example below shows some code to import ozone and NO2 data for every site in the South East that is currently measuring it.

# get meta data
site_meta_all <- importMeta(all = TRUE)

# filter meta data
se_o3_no2_meta <- 
  filter(site_meta_all,
         zone == "South East",
         variable %in% c("O3", "NO2"),
         end_date == "ongoing")

# import (uniuqe removes duplicates)
southeast_aq <- 
  importAURN(
    site = unique(se_o3_no2_meta$code),
    year = 2021,
    pollutant = c("o3","no2")
  )

# see summary
summary(southeast_aq)
      date                           o3              no2         
 Min.   :2021-01-01 00:00:00   Min.   : -0.60   Min.   : -1.324  
 1st Qu.:2021-04-02 05:45:00   1st Qu.: 35.17   1st Qu.:  6.799  
 Median :2021-07-02 11:30:00   Median : 52.14   Median : 13.283  
 Mean   :2021-07-02 11:30:00   Mean   : 50.78   Mean   : 17.782  
 3rd Qu.:2021-10-01 17:15:00   3rd Qu.: 67.01   3rd Qu.: 24.652  
 Max.   :2021-12-31 23:00:00   Max.   :215.69   Max.   :162.889  
                               NA's   :95200    NA's   :15449    
     site               code          
 Length:157680      Length:157680     
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      
Try manipulating the imported data

Some of the ideas above will take a bit of getting used to — but they are very powerful. Even with a little bit of knowledge about how to filter data, it will be surprising just how much you can do. It is worth taking a bit of time to filter these data sets in different ways. We’ll build on these skills later in the course…

3 Accessing Meteorological Data

3.1 Importing Global Met Data

The worldmet effectively has two main functions for accessing meteorological data. First we’ll consider getMeta(), which lists met sites around the globe.

# A tibble: 13,404 × 13
   usaf   wban  station     ctry  st    call  latit…¹ longi…² elev(…³ begin     
   <chr>  <chr> <chr>       <chr> <chr> <chr>   <dbl>   <dbl>   <dbl> <date>    
 1 010060 99999 EDGEOYA     NO    <NA>  <NA>     78.2    22.8    14   1973-01-01
 2 010070 99999 NY-ALESUND  SV    <NA>  <NA>     78.9    11.9     7.7 1973-01-06
 3 010080 99999 LONGYEAR    SV    <NA>  ENSB     78.2    15.5    26.8 1975-09-29
 4 010090 99999 KARL XII O… SV    <NA>  <NA>     80.6    25       5   1955-01-01
 5 010100 99999 ANDOYA      NO    <NA>  ENAN     69.3    16.1    13.1 1931-01-03
 6 010110 99999 KVITOYA     SV    <NA>  <NA>     80.1    31.5    10   1986-11-18
 7 010150 99999 HEKKINGEN … NO    <NA>  <NA>     69.6    17.8    14   1980-03-14
 8 010160 99999 KONGSOYA    NO    <NA>  <NA>     78.9    28.9    20   1993-05-01
 9 010170 99999 AKSELOYA    SV    <NA>  <NA>     77.7    14.8     6   1973-01-01
10 010200 99999 SORKAPPOYA  SV    <NA>  <NA>     76.5    16.6    10   2010-10-08
# … with 13,394 more rows, 3 more variables: end <date>, code <chr>,
#   dist <lgl>, and abbreviated variable names ¹​latitude, ²​longitude,
#   ³​`elev(m)`

Much like openairmaps::networkMap(), worldmet::getMeta() can also return an interactive map:

getMeta(returnMap = TRUE)

Figure 3: Global met stations, obtained using worldmet::getMeta().

Often, you’ll not be sure which meteorological station is the “local” site to your air quality measurement site. To find your local met station, you can provide the lat and lon arguments to getMeta(), which will then return the n closest stations (defaulting to 10). The code chunk below shows the 5 closest met stations to the Marylebone Road air quality site.

site_met <- importMeta()
my1 <- filter(site_met, code == "MY1")

getMeta(lat = my1$latitude, lon = my1$longitude, n = 5)
# A tibble: 5 × 15
  usaf   wban  station      ctry  st    call  latit…¹ longi…² elev(…³ begin     
  <chr>  <chr> <chr>        <chr> <chr> <chr>   <dbl>   <dbl>   <dbl> <date>    
1 037700 99999 ST JAMES PA… UK    <NA>  <NA>     51.5  -0.117     5   2009-12-18
2 037683 99999 CITY         UK    <NA>  EGLC     51.5   0.055     5.8 1988-01-29
3 036720 99999 NORTHOLT     UK    <NA>  EGWU     51.6  -0.418    37.8 1973-01-01
4 037720 99999 HEATHROW     UK    <NA>  EGLL     51.5  -0.461    25.3 1948-12-01
5 037663 99999 BIGGIN HILL  UK    <NA>  EGKB     51.3   0.033   182.  1988-01-05
# … with 5 more variables: end <date>, code <chr>, longr <dbl>, latr <dbl>,
#   dist <dbl>, and abbreviated variable names ¹​latitude, ²​longitude,
#   ³​`elev(m)`

When you use lat and lon and returnMap = TRUE you’ll notice that your AQ site is labelled using a red dot!

getMeta(lat = my1$latitude, lon = my1$longitude, n = 5, returnMap = TRUE)

Figure 4: The five closest air quality

Once you’ve decided on your met station, simply use the usaf and wban along with importNOAA() to access the met data. You’ll notice some similar arguments to importAURN(). The important one to concern yourself with is “year”, where you define your the year of data you are interested in.

worldmet::importNOAA(code = "037720-99999", year = 2020)
# A tibble: 8,784 × 25
   code    station date                latit…¹ longi…²  elev    ws    wd air_t…³
   <fct>   <fct>   <dttm>                <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 037720… HEATHR… 2020-01-01 00:00:00    51.5  -0.461  25.3 1.53  123.     3.4 
 2 037720… HEATHR… 2020-01-01 01:00:00    51.5  -0.461  25.3 2.07  118.     3.6 
 3 037720… HEATHR… 2020-01-01 02:00:00    51.5  -0.461  25.3 2.27  101.     4.4 
 4 037720… HEATHR… 2020-01-01 03:00:00    51.5  -0.461  25.3 2.6   100.     4.23
 5 037720… HEATHR… 2020-01-01 04:00:00    51.5  -0.461  25.3 2.9    90.8    4.2 
 6 037720… HEATHR… 2020-01-01 05:00:00    51.5  -0.461  25.3 3.27   87.4    4.47
 7 037720… HEATHR… 2020-01-01 06:00:00    51.5  -0.461  25.3 1.5    85      4.87
 8 037720… HEATHR… 2020-01-01 07:00:00    51.5  -0.461  25.3 1     196.     4.73
 9 037720… HEATHR… 2020-01-01 08:00:00    51.5  -0.461  25.3 0.833  96.9    3.7 
10 037720… HEATHR… 2020-01-01 09:00:00    51.5  -0.461  25.3 1.33   58.8    4.9 
# … with 8,774 more rows, 16 more variables: atmos_pres <dbl>,
#   visibility <dbl>, dew_point <dbl>, RH <dbl>, ceil_hgt <dbl>, cl_1 <dbl>,
#   cl_2 <dbl>, cl_3 <dbl>, cl <dbl>, cl_1_height <dbl>, cl_2_height <dbl>,
#   cl_3_height <dbl>, precip_12 <dbl>, precip_6 <dbl>, pwc <chr>,
#   precip <dbl>, and abbreviated variable names ¹​latitude, ²​longitude,
#   ³​air_temp

3.2 Combining Air Quality and Meteorological Data

We are now in the situation where we have multiple data frames we want to combine - one air quality data set, and one meteorological one. “Joining” can be complicated, so let’s create some example data to demonstrate:

x = tibble(
  country = c("UK", "USA", "France"),
  capital = c("London", "Washington DC", "Paris")
)

y = tibble(
  country = c("UK", "France", "Germany"),
  language = c("English", "French", "German")
)

Have a look at the data in Table 1. Notice that they share a column - “country” - though the items in it are not consistent (the UK and France are in both, but one has the USA and the other Germany).

Table 1: The two dataframes. Notice the differences in their shared columns.

(a) Capital cities
country capital
UK London
USA Washington DC
France Paris
(b) Languages
country language
UK English
France French
Germany German

The tidyverse package dplyr contains “joining” functions. To familiarise yourself with them, read the “cheat sheet” available here. There are several different joining functions - notice the differences between them below in Table 2.

Table 2: The four different kinds of ‘mutating’ joins and the functions used to achieve them.

(a) left_join(x, y)
country capital language
UK London English
USA Washington DC NA
France Paris French
(b) right_join(x, y)
country capital language
UK London English
France Paris French
Germany NA German
(c) inner_join(x, y)
country capital language
UK London English
France Paris French
(d) full_join(x, y)
country capital language
UK London English
USA Washington DC NA
France Paris French
Germany NA German

Nine times out of ten, left_join() will do the job, so that’s what we’re going to use for our data. The common column between our two data frames is likely “date”, so that is what we will join by. You don’t have to explicitly tell a “join” function which column to join by - it’ll work it out, but it is nice to be explicit.

It is a good idea, to avoid data loss, for the first data frame listed to have the best time resolution - join a daily data frame with a monthly one rather than the other way around. In this case, both are hourly, so it doesn’t really matter which goes first.

The below chunk imports some air quality data for Marylebone road, some met data from Heathrow Airport, and then joins them together. Note that we use the select() function from dplyr to drop some of the met data; often we’re only interested in wind speed and direction.

aq <- importAURN(site = "my1", pollutant = c("no2", "nox", "pm2.5", "pm10"), year = 2020)
met <- worldmet::importNOAA(code = "037720-99999", year = 2020)
all <- left_join(select(met, date, ws, wd), aq, by = "date")
all
# A tibble: 8,784 × 9
   date                   ws    wd   no2   nox pm2.5  pm10 site            code 
   <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>           <chr>
 1 2020-01-01 00:00:00 1.53  123.   45.8  166.  58.1  69   London Maryleb… MY1  
 2 2020-01-01 01:00:00 2.07  118.   52.6  189.  43.2  45   London Maryleb… MY1  
 3 2020-01-01 02:00:00 2.27  101.   44.8  152.  43    46.2 London Maryleb… MY1  
 4 2020-01-01 03:00:00 2.6   100.   40.2  144.  42.8  45.1 London Maryleb… MY1  
 5 2020-01-01 04:00:00 2.9    90.8  47.3  158.  36.8  40.8 London Maryleb… MY1  
 6 2020-01-01 05:00:00 3.27   87.4  40.4  136.  39.2  44.1 London Maryleb… MY1  
 7 2020-01-01 06:00:00 1.5    85    42.7  148.  38.7  44.3 London Maryleb… MY1  
 8 2020-01-01 07:00:00 1     196.   40.7  181.  39.3  40.4 London Maryleb… MY1  
 9 2020-01-01 08:00:00 0.833  96.9  42.6  147.  37.3  37.5 London Maryleb… MY1  
10 2020-01-01 09:00:00 1.33   58.8  38.3  139.  34.3  34.6 London Maryleb… MY1  
# … with 8,774 more rows

Notice what has happened. As we joined by the “date” column, that is at the far left. Then the met columns are next, as we provided that data frame first. Finally, the AQ data is at the right, as it was provided second.

If the second data frame matches multiple rows in the first one, it will match to both. This might mean that your resulting data frame is longer than what you put in to begin with - something to be avoided. It can be a good idea therefore to check that you’ve not gained any rows using a function like nrow().

nrow(aq)
[1] 8784
nrow(met)
[1] 8784
nrow(all)
[1] 8784